- Introduce the neural network design
- Calculate logistic regression in R by using glm
- Implement all calculations in Python
- Build the same model by using Tensorflow
- Build the same by using PyTorch
2023
Press F11 for full screen mode
Figure. Neural network design for logistic regression
\(x \mapsto z^1_{1}=w^1_{11}x+b^1_1\).
\(neuron \mapsto \sigma(z^1_{1})\). Here, \(\sigma=logit^{-1}\).
We would like to have a neuron (our only neuron) fires 1 when the input is 5, 11, or 13 and fires 0 when the input is 2, 3, or 7.
This needs to be achieved by choosing values for \(w^1_{11}\) and \(b^1_1\). Would it be possible? Unfortunately, no. We will try to choose \(w^1_{11}\) and \(b^1_1\) such that the model performs as close as it can to what is desired.
training_data <- data.frame(x=c(2,3,5,7,11,13), y=c(0,0,1,0,1,1)) log_reg.res <- glm(data = training_data, y ~ x, family = binomial) w=coef(summary(log_reg.res))[2,1] b=coef(summary(log_reg.res))[1,1]
‘glm’ can be used to fit generalized linear models. The family binomial will do the logistic regression.
\(w=w^1_{11}=0.5518\) and \(b=b^1_1=-3.5491\) are found by glm.
x1=seq(0, 13, by=0.2); y1=inv.logit(w*x1+b)
ggplot(data=training_data, aes(x=x, y=y)) + geom_point() +
geom_line(data=data.frame(x=x1,y=y1), aes(x=x1, y=y1)) +
labs(title = paste0("w = ",round(w,4)," b = ",round(b,4)))
Outputs, \(a^{l-1}_k\), from the previous layer (\(l-1\)), come to \(j_{th}\) neuron in Layer \(l\). First, \(z^l_j\) will be calculated by \(z^l_{j} =b^l_j + \sum_k w^l_{jk} a^{l-1}_k\) Deploying vectors and matrices. \[z^l = w^la^{l-1}+b^l\] where \(z^l=[z^l_1\; z^l_2 \, \dots \, z^l_m]^T\), \(w^l=[w^l_{jk}]_{1\leq j\leq m, 1\leq k\leq n}\), \(a^{l-1}=[\sigma(z^{l-1}_1)\; \sigma(z^{l-1}_2) \, \dots \, \sigma(z^{l-1}_n)]^T\), and \(b^l=[b^l_1\; b^l_2 \, \dots \, b^l_m]^T\).
After the last layer, there will be a cost function comparing the last output of the model against the desired output. Searching for \(w\)’s and \(b\)’s that minimize (typically) the cost function is called learning. One popular method for this optimization is gradient descending. For this, we need to find the gradient of the cost function as a function of \(w\)’s and \(b\)’s.
The cost function for the logistic regression that we are building is \[ C(w,b) = -\sum_{i=1}^6 y_i \ln( logit^{-1}(wx_i+b) ) + (1-y_i) \ln(1-logit^{-1}(wx_i+b)).\] It is also called binary cross entropy.
The logistic regression model that we are building as an artificial neural network is simple enough to calculate the gradient just by following the basic gradient formula.
In general, however, it can be challenging as there are more layers and more neurons. For that, there is an algorithmic approach called back propagation. First, Define \(\delta^l_j=\frac{\partial C}{\partial z^l_j}\). Layer \(L\) is the last layer.
BP1: \(\delta^L_j=\frac{\partial C}{\partial a^L_j}\sigma'(z^L_j)\)
BP2: \(\delta^l=\bigl( (w^{l+1})^T \delta^{l+1}\bigr) \odot \sigma '(z^l) \text{ for } l<L-1. \;\;\odot: \text{Hadamard product}\)
BP3: \(\frac{\partial C}{\partial b^l_j}=\delta^l_j\)
BP4: \(\frac{\partial C}{\partial w^l_{jk}}=a^{l-1}_k\delta^l_j\)
Back propagation is a few layers of chain rule calculation. It’s good for implementing step-by-step gradient calculation.
While our neural network model for logistic regression has one, the last, layer, let’s try to find the back propagation in the calculation.
The cost function, \(C\), is a sum of \(-y\ln(a)-(1-y)\ln(1-a)\) with several \(y\) and \(x\) values as parameters. So, let’s just work with this function as \(C\), that is, \(C(w,b)=-y\ln(a)-(1-y)\ln(1-a)\). Remember that \(z=wx+b\) and \(a=\sigma(z)\). (dropping the negative sign for simplicity)
\(\frac{\partial C}{\partial w}=-y\frac{1}{a}\frac{\partial a}{\partial w}-(1-y)\frac{-1}{1-a}\frac{\partial a}{\partial w}\)
\(\;\;\;\;\;\;=-y\frac{1}{a}\sigma'(z)\frac{\partial z}{\partial w}-(1-y)\frac{-1}{1-a}\sigma'(z)\frac{\partial z}{\partial w}\)
\(\;\;\;\;\;\;=-y\frac{1}{a}\sigma'(z)x-(1-y)\frac{-1}{1-a}\sigma'(z)x\)
\(\;\;\;\;\;\;=-\bigl(y\frac{1}{a}\sigma'(z)+(1-y)\frac{-1}{1-a}\sigma'(z) \bigl)x\)
\(\;\;\;\;\;\;=-\bigl(y\frac{1}{a}\sigma'(z)+(1-y)\frac{-1}{1-a}\sigma'(z) \bigl)a^0_1\)
\(\;\;\;\;\;\;= \frac{\partial C}{\partial z}a^0_1=\frac{\partial C}{\partial z^1_1}a^0_1=\delta^1_1a^0_1.\;\;\) This is BP4.
\(\frac{\partial C}{\partial b}=-y\frac{1}{a}\frac{\partial a}{\partial b}-(1-y)\frac{-1}{1-a}\frac{\partial a}{\partial b}\)
\(\;\;\;\;\;\;=-y\frac{1}{a}\sigma'(z)\frac{\partial z}{\partial b}-(1-y)\frac{-1}{1-a}\sigma'(z)\frac{\partial z}{\partial b}\)
\(\;\;\;\;\;\;=-y\frac{1}{a}\sigma'(z)-(1-y)\frac{-1}{1-a}\sigma'(z)\)
\(\;\;\;\;\;\;=\delta^1_1,\)
this is BP3.
The implementation using basic python computation codes can be complicated as there are more layers and more neurons. Let’s now look at the usage of tensorflow package. As it is shown here, one layer can be added by one extra line and the number of neurons will not add any complexity to the code.
The results are the same (okay, almost the same).